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Data describing human interactions often suffer from incomplete sampling of the underlying pop¬ 
ulation. As a consequence, the study of contagion processes using data-driven models can lead to 
a severe underestimation of the epidemic risk. Here we present a systematic method to alleviate 
this issue and obtain a better estimation of the risk in the context of epidemic models informed by 
high-resolution time-resolved contact data. We consider several such data sets collected in various 
contexts and perform controlled resampling experiments. We show how the statistical information 
contained in the resampled data can be used to build a series of surrogate versions of the unknown 
contacts. We simulate epidemic processes on the resulting reconstructed data sets and show that 
it is possible to obtain good estimates of the outcome of simulations performed using the complete 
data set. We discuss limitations and potential improvements of our method. 


Human interactions play an important role in deter¬ 
mining the potential transmission routes of infectious 
diseases and other contagion phenomena [I]. Their mea¬ 
sure and characterisation thus represent an invaluable 
contribution to the study of transmissible diseases |2|. 
While surveys and diaries in which volunteer participants 
record their encounters mu have provided crucial in¬ 
sights (see however mmm for recent investigations of 
the memory biases inherent in self-reporting procedures), 
new approaches have recently emerged to measure con¬ 
tact patterns between individuals with high resolution, 
using wearable sensors that can detect the proximity of 
other similar devices □SHI!]. The resulting measuring 
infrastructures register contacts specifically within the 
closed population formed by the participants wearing 
sensors, with typically high spatial and temporal resolu¬ 
tions. In the recent years, several data gathering efforts 
have used such methods to obtain, analyse and publish 
data sets describing the contact patterns between indi¬ 
viduals in various contexts in the form of temporal net¬ 
works pi mui: nodes represent individuals and, at 
each time step, a link is drawn between pairs of individ¬ 
uals who are in contact [25] . Such data has been used 
to inform models of epidemic spreading phenomena used 
to evaluate epidemic risks and mitigation strategies in 
specific, size-limited contexts such as schools or hospitals 
[I4l fl9l [20l l22l 1261432] . finding in particular outcomes 
consistent with observed outbreak data m or providing 
evidence of links between specific contacts and transmis¬ 
sion events USED- 

Despite the relevance and interest of such detailed data 
sets, as illustrated by these recent investigations, they 
suffer from the intrinsic limitation of the data gather¬ 
ing method: contacts are registered only between par¬ 
ticipants wearing sensors. Contacts with and between 
individuals who do not wear sensors are thus missed. In 
other words, as most often not all individuals accept to 
participate by wearing sensors, many data sets obtained 
by such techniques suffer from population sampling, de¬ 


spite efforts to maximise participation through e.g. sci¬ 
entific engagement of participants [24] [33[. Hence, the 
collected data only contains information on contacts oc¬ 
curring among a fraction of the population under study. 

Population sampling is well-known to affect the proper¬ 
ties of static networks [341436] : various statistical proper¬ 
ties and mixing patterns of the contact network of a frac¬ 
tion of the population of interest may differ from those 
of the whole population, even if the sampling is uniform 
[371440] , and several works have focused on inferring net¬ 
work statistics from the knowledge of incomplete network 
data [39] I4TU44] . Both structural and temporal proper¬ 
ties of time-varying networks might as well be affected 
by missing data effects mm ■ 

In addition, a crucial though little studied consequence 
of such missing data is that simulations of dynamical pro¬ 
cesses in data-driven models can be affected if incomplete 
data are used [38]l39] 45] . For instance, in simulations of 
epidemic spreading, excluded nodes are by definition un¬ 
reachable and thus equivalent to immunised nodes. Due 
to herd vaccination effects, the outcome of simulations of 
epidemic models on sampled networks is thus expected 
to be underestimated with respect to simulations on the 
whole network. (We note however, that in the different 
context of transportation networks, it was found in [45] 
that the inclusion of the most important transportation 
nodes can be sufficient to describe the global worldwide 
spread of influenza-like illnesses, at least in terms of times 
of arrival of the spread in various cities.) How to estimate 
the outcome of dynamical processes on contact networks 
using incomplete data remains an open question. 

Here we make progresses on this issue for incompletely 
sampled data describing networks of human face-to-face 
interactions, collected by infrastructures based on sen¬ 
sors, under the assumption that the population partici¬ 
pating to the data collection is a uniform random sample 
of the whole population of interest. (We do not therefore 
address here the issue of non-uniform sampling of con¬ 
tacts that may result from other measurement methods 
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such as diaries or surveys.) We proceed through resam¬ 
pling experiments on empirical data sets in which we ex¬ 
clude uniformly at random a fraction of the individuals 
(nodes of the contact network). We measure how rele¬ 
vant network statistics vary under such uniform resam¬ 
pling and confirm that, although some crucial properties 
are stable, numerical simulations of spreading processes 
performed using incomplete data lead to strong underes¬ 
timations of the epidemic risk. Our goal and main con¬ 
tribution consists then in putting forward and comparing 
a hierarchy of systematic methods to provide better esti¬ 
mates of the outcome of models of epidemic spread in the 
whole population under study. To this aim, we do not try 
to infer the true sequence of missing contacts. Instead, 
the methods we present consist in the construction of 
surrogate contact sequences for the excluded nodes, us¬ 
ing only structural and temporal information available in 
the resampled contact data. We perform simulations of 
spreading processes on the reconstructed data sets, ob¬ 
tained by the union of the resampled and surrogate con¬ 
tacts, and investigate how their outcomes vary depending 
on the amount of information included in the reconstruc¬ 
tion method. We show that it is possible to obtain out¬ 
comes close to the results obtained on the complete data 
set, while, as mentioned above, using only the incomplete 
data severely underestimates the epidemic risk. We show 
the efficiency of our procedure using three data sets col¬ 
lected in widely different contexts and representative of 
very different population structures found in day-to-day 
life: a scientific conference, a high school and a work¬ 
place. We finally discuss the limitations of our method 
in terms of sampling range, model parameters and pop¬ 
ulation sizes. 


RESULTS 

Data and Methodology 

We consider data sets describing contacts between in¬ 
dividuals, collected by the SocioPatterns collaboration 
(http: //www. sociopatterns . org) in three different set¬ 
tings: a workplace (office building, InVS) [46], a high 
school (Thiersl3) [24] and a scientific conference (SFHH) 
m m- These data correspond to the close face-to-face 
proximity of individuals equipped with wearable sensors, 
at a temporal resolution of 20 seconds [16]. Table [I] sum¬ 
marises the characteristics of each data set. The con¬ 
tact data are represented by temporal networks, in which 
nodes represent the participating individuals and a link 
between two nodes i and j at time t indicates that the 
two corresponding persons were in contact at that time. 
These three data sets were chosen as representative of 
different types of day-to-day contexts and of different 
contact network structures: the SFHH data correspond 
to a rather homogeneous contact network; the InVS and 
Thiersl3 populations were instead structured in depart¬ 
ments and classes, respectively. Moreover, high school 


classes (Thiersl3) are of similar sizes while the InVS de¬ 
partment sizes are unequal. Finally, the high school con¬ 
tact patterns (Thiersl3) are constrained by strict and 
repetitive school schedules, while contacts in offices are 
less regular across days. 

To quantify how the incompleteness of data, assumed 
to stem from a uniformly random participation of in¬ 
dividuals to the data collection, affects the outcome 
of simulations of dynamical processes, we consider as 
ground truth the available data and perform popula¬ 
tion resampling experiments by removing a fraction / 
of the nodes uniformly at random. (Note that the 
full data sets are also samples of all the contacts that 
occurred in the populations, as the participation rate 
was lower than 100% in each case. In the Thiersl3 
case however, the participation rate was quite high.) 
We then simulate on the resampled data the paradig¬ 
matic Susceptible-Infectious-Recovered (SIR) and the 
Susceptible-Infectious-Susceptible (SIS) models of epi¬ 
demic propagation. In these models, a susceptible (S) 
node becomes infectious (I) at rate (3 when in contact 
with an infectious node. Infectious nodes recover spon¬ 
taneously at rate fi. In the SIR model, nodes then enter 
an immune recovered (R) state, while in the SIS model, 
nodes become susceptible again and can be reinfected. 
The quantities of interest are for the SIR model the dis¬ 
tribution of epidemic sizes, defined as the final fraction 
of recovered nodes, and for the SIS model the average 
fraction of infectious nodes ioo in the stationary state. 
We also calculate for the SIR model the fraction of epi¬ 
demics that infect more than 20% of the population and 
the average size of these epidemics. For the SIS model, 
we determine the epidemic threshold /3 C for different val¬ 
ues of fi: it corresponds to the value of (3 that separates 
an epidemic-free state (ioo = 0) for (3 < f3 c from an en¬ 
demic state (ioo > 0) for (3 > f3 c , and is thus an important 
indicator of the epidemic risk. We refer to the Methods 
section for further details on the simulations. 

We then present several methods for constructing sur¬ 
rogate data using only information contained in the re¬ 
sampled data. We compare for each data set the out¬ 
comes of simulations performed on the whole data set, 
on resampled data sets with a varying fraction of nodes 
removed, /, and on the reconstructed data sets built us¬ 
ing these various methods. 


Uniformly resampled contact networks 

Missing data are known to affect the various proper¬ 
ties of contact networks in different ways. In particular, 
the number of neighbours (degree) of a node decreases 
as the fraction / of removed nodes increases, since re¬ 
moving nodes also removes links to these nodes. Under 
the hypothesis of uniform sampling, the average degree 
(k) becomes (1 — f)(k) for the resampled network [47] . 
As a result, the density of the resampled aggregated con¬ 
tact network, defined as the number of links divided by 
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the total number of possible links between the nodes, 
does not depend on /. The same reasoning applies to 
the density pab of links between groups of nodes A and 
B , defined as the number of links Eab between nodes of 
group A and nodes of group F>, normalised by the max¬ 
imum possible number of such links, where ua is 

the number of nodes of group A (for A = B, the max¬ 
imum possible number of links is tia(tia — l)/2.): both 
the expected number of neighbours of group B for nodes 
of group A (given by Eab/^a) and the number ub of 
nodes in group B are indeed reduced by a factor (1 — /), 
so that pab remains constant. This means that the link 
density contact matrix, which gathers these densities and 
gives a measure of the interaction between groups (here 
classes or departments), is stable under uniform resam¬ 
pling. We illustrate these results on our empirical data 
sets in supplementary figures 1, 2, 4 and 5. Table [TT] and 
supplementary figure 2 show in particular that the simi¬ 
larities between the original and resampled matrices are 
high for all data sets (see supplementary figures 4-5 for 
the contact matrices themselves). 

Finally, the temporal statistics of the contact network 
are not affected by population sampling, as already noted 
in JT6: for other data sets: the distributions of contact 
and inter-contact durations (the inter-contact durations 
are the times between consecutive contacts on a link), 
of number of contacts per link and of cumulated con¬ 
tact durations (i.e., of the link weights in the aggregated 
network) do not change when the network is sampled 
uniformly (supplementary figure 1). In the case of struc¬ 
tured population, an interesting property is moreover il¬ 
lustrated in supplementary figures 6-7: although the dis¬ 
tributions of contact durations occurring between mem¬ 
bers of the same group or between individuals belong¬ 
ing to different groups are indistinguishable, this is not 
the case for the distributions of the numbers of contacts 
per link nor, as a consequence, for the distributions of 
cumulated contact durations. In fact, both cumulated 
contact durations and numbers of contacts per link are 
more broadly distributed for links joining members of the 
same group. The figures show that this property is stable 
under uniform resampling. 

Despite the robustness of these properties, the outcome 
of simulations of epidemic spread is strongly affected 
by the resampling. As Fig. [l] illustrates for instance, 
the probability of large outbreaks in the SIR model de¬ 
creases strongly as / increases and even vanishes at large 
/. As mentioned above, such a result is expected, since 
the removed nodes act as if they were immunised: sam¬ 
pling hinders the propagation in simulations by removing 
transmission routes between the remaining nodes. As 
a consequence, the prevalence and the final size of the 
outbreaks are systematically underestimated by simula¬ 
tions of the SIR model on the resampled network with 
respect to simulations on the whole data set (for the SIS 
model, the epidemic threshold is overestimated): resam¬ 
pling leads overall to a systematic underestimation of the 
epidemic risk, and Fig. [l] illustrates the extent of this un¬ 


derestimation for the data at hand. 


Estimation of epidemic sizes through simulations on 
reconstructed temporal networks 

We now present a series of methods to improve the es¬ 
timation of the epidemic risk in simulations of epidemic 
spread on temporal network data sets in which nodes (in¬ 
dividuals) are missing uniformly at random. Note that 
we do not address here the problem of link prediction [48] 
as our aim is not to infer the missing contacts. The hier¬ 
archy of methods we put forward uses increasing amounts 
of information corresponding to increasing amounts of 
detail on the group and temporal structure of the con¬ 
tact patterns, as measured in the resampled network. We 
moreover assume that the timelines of scheduled activity 
are known (i.e., nights and weekends, during which no 
contact occurs). 

For each data set, considered as ground truth, we cre¬ 
ate resampled data sets by removing at random a fraction 
/ of the N nodes. We then measure on each resampled 
data set a series of statistics of the resulting contact net¬ 
work and construct stochastic, surrogate versions of the 
missing part of the network by creating for each miss¬ 
ing node a surrogate instance of its links and a synthetic 
timeline of contacts on each surrogate link, in the dif¬ 
ferent ways described below (see Supplementary Infor¬ 
mation and Methods section for more details on their 
practical implementation). 

Method 0. As discussed above, the first effect of miss¬ 
ing data is to decrease the average degree of the aggre¬ 
gate contact network, while keeping its density constant. 
Hence, the simplest approach is to merely compensate 
this decrease. We therefore measure the density of the re¬ 
sampled contact network p s , as well as the average aggre¬ 
gate duration of the contacts, (w) s . We then add back the 
missing nodes and create surrogate links between these 
nodes and between these nodes and the nodes of the re¬ 
sampled data set at random, with the only constraint to 
keep the overall link density fixed to p s . We then at¬ 
tribute to each surrogate link the same weight (w) s and 
create for each link a timeline of randomly chosen contact 
events of equal length At = 20s (the temporal resolution 
of the data set) whose total duration gives back (w) s . 

Method W. The heterogeneity of aggregated contact 
durations is known to play a role in the spreading pat¬ 
terns of model diseases Him MM- We therefore refine 
Method 0 by collecting in the resampled data the list {w} 
of aggregate contact durations, or weights (W). We build 
the surrogate links and surrogate timelines of contacts on 
each link as in Method 0, except that each surrogate link 
carries a weight extracted at random from {re}, instead 
of the average (w} 8 . 

Method WS. The fact that the population is divided 
into groups of individuals such as classes or departments 
can have a strong impact on the structure of the contact 
network [20, 23j and on spreading processes [50] , We 
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thus measure here the link density contact matrix of the 
resampled data, and construct surrogate links in a way 
to keep this matrix fixed (equal to the value measured 
in the resampled data), in the spirit of stochastic block 
models with fixed numbers of edges between blocks m- 
Moreover, we collect in the resampled data two sepa¬ 
rate lists of aggregate contact durations: {re} mt gathers 
the weights of links between individuals belonging to the 
same group, and {rc} ext is built with the weights of links 
joining individuals of different groups. For each surrogate 
link, its weight is extracted at random either from {re} mt 
if it joins individuals of the same group or from {u;} ext 
if it associates individuals of different groups. Timelines 
are then attributed to links as in W. This method as¬ 
sumes that the number of missing nodes in each group 
is known, and preserves the group structure (S) of the 
population. 

Method WT. Several works have investigated how the 
temporal characteristics of networks (such as burstiness) 
can slow down or accelerate spreading [25l 29; [52] . In or¬ 
der to take these characteristics into account, we measure 
in the resampled data the distributions of number of con¬ 
tacts per link and of contact and inter-contact durations, 
in addition to the global network density. We build sur¬ 
rogate links as in Method W, and construct on each link 
a synthetic timeline in a way to respect the measured 
temporal statistics (T) of contacts. More precisely, we 
attribute at random a number of contacts (taken from 
the measured distribution) to each surrogate link, and 
then alternate contact and inter-contact durations taken 
at random from the respective empirical distributions. 

Method WST. This method conserves the distribution 
of link weights (W), the group structure (S), and the 
temporal characteristics of contacts (T): surrogate links 
are built and weights assigned as in method WS, and 
contact timelines on each link as in method WT. 

Each of these methods uses a different amount of in¬ 
formation gathered from the resampled data. Methods 0, 
W and WT include an increasing amount of detail on the 
temporal structure of contacts: method 0 assumes homo¬ 
geneity of aggregated contact durations, while W takes 
into account their heterogeneity, and WT reproduces het¬ 
erogeneities of contact and inter-contact durations. On 
the other hand, neither of these three methods assume 
any knowledge of the population group structure. This 
can be due either to an effective complete lack of knowl¬ 
edge about the population structure, as in the SFHH 
data, or also to the lack of data on the repartition of 
the missing nodes in the groups. Methods WS and WST 
on the other hand reproduce the group structure as in a 
stochastic block model with fixed number of links within 
and between groups, and take into account the difference 
between the distributions of numbers of contacts and ag¬ 
gregate durations between individuals of the same or of 
different groups. Indeed, links within groups correspond 
on average to larger weights, as found empirically in [50] 
and discussed above (supplementary figures 5-6). Over¬ 
all, method WST is the one that uses most information 


measured in the resampled data. (Additional properties 
such as the transitivity -which is also stable under re¬ 
sampling procedure, see supplementary figure 3- can also 
be measured in the resampled data and imposed in the 
construction of surrogate links, as detailed in the Sup¬ 
plementary Information. This comes however at a strong 
computational cost and we have verified that it does not 
impact significantly our results, as shown in the supple¬ 
mentary figure 20.) 

We check in Table [IT] and supplementary figures 8- 
13 that the statistical properties of the resulting recon¬ 
structed (surrogate) networks, obtained by the union of 
the resampled data and of the surrogate links, are simi¬ 
lar to the ones of the original data for the WST method. 
We emphasise again that our aim is not to infer the true 
missing contacts, so that we do not compare the detailed 
structures of the surrogate and original contact networks. 

Figures [2} [3] [4] and supplementary figures 16-19 dis¬ 
play the outcome of SIR spreading simulations performed 
on surrogate networks obtained using the various recon¬ 
struction methods, compared with the outcome of simu¬ 
lations on the resampled data sets, for various values of /. 
Method 0 leads to a clear overestimation of the outcome 
and does not capture well the shape of the distribution 
of outbreak sizes. Method W gives only slightly bet¬ 
ter results. The overall shape of the distribution is bet¬ 
ter captured for the three reconstruction methods using 
more information: WS, WT and WST (note that for the 
SFHH case the population is not structured, so that W 
and WS are equivalent, as are WT and WST). The WST 
method matches best the shape of the distributions and 
yields distributions much more similar to those obtained 
by simulating on the whole data set than the simulations 
performed on the resampled networks. We also show in 
Fig-! the fraction of outbreaks that reach at least 20 % 
of the population and the average epidemic size for these 
outbreaks. In the case of simulations performed on re¬ 
sampled data, we rapidly lose information about the size 
and even the existence of large outbreaks as / increases. 
Simulations using data reconstructed with methods 0 and 
W, on the contrary, largely overestimate these quantities, 
which is expected as infections spread easier on random 
graphs than on structured graphs [SB E>1, especially if 
the heterogeneity of the aggregated contact durations is 
not considered [20; 22]. Taking into account the popu¬ 
lation structure or using contact sequences that respect 
the temporal heterogeneities (broad distributions of con¬ 
tact and inter-contact durations) yield better results (WS 
and WT cases, respectively). Overall, the WST method, 
for which the surrogate networks respect all these con¬ 
straints, yields the best results. 

We show in the Supplementary Information that simi¬ 
lar results are obtained for different values of the spread¬ 
ing parameters. Moreover, as shown in Fig. [6] and sup¬ 
plementary figures 14-15, the phase diagram obtained 
for the SIS model when using reconstructed networks 
is much closer to the original than for resampled net¬ 
works. Overall, simulations on networks reconstructed 
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using the WST method yield a much better estimation 
of the epidemic risk than simulations using resampled 
network data, for both SIS and SIR models. 


Reshuffled data sets. 

Even when simulations are performed on reconstructed 
contact patterns built with the WST method, the max¬ 
imal outbreak sizes are systematically overestimated 
(Figs. as well as, in most cases, the probability and 

average size of large outbreaks, especially for the SFHH 
case (Figs. [4]-[5]). These discrepancies might stem from 
structural and/or temporal correlations present in empir¬ 
ical contact data that are not taken into account in our 
reconstruction methods. In order to test this hypothesis, 
we construct several reshuffled data sets and use them as 
initial data in our resampling and reconstruction proce¬ 
dure. We use both structural and temporal reshuffling 
as described in the Methods section, in order to remove 
either structural correlations, temporal correlations, or 
both, from the original data sets. We then proceed to a 
resampling and reconstruction procedure (using the WST 
method) as for the original data, and perform numerical 
simulations of SIR processes. As for the original data, 
simulations on resampled data lead to a strong underes¬ 
timation of the process outcome, and simulations using 
the reconstructed data gives much better results. 

We show in the supplementary figures 21-22 that we 
still obtain discrepancies, and in particular overestima¬ 
tions of the largest epidemic sizes, when we use tempo¬ 
rally reshuffled data in which the link structure of the 
contact network is maintained. If on the other hand we 
use data in which the network structure has been reshuf¬ 
fled in a way to cancel structural correlations within each 
group, the reconstruction procedure gives a very good 
agreement between the distributions of epidemic sizes of 
original and reconstructed data, as shown in Fig. [7| More 
precisely we consider here “CM-shuffled” data, i.e., con¬ 
tact networks in which the links have been reshuffled ran¬ 
domly but separately for each pair of groups, i.e., a link 
between an individual of group A and an individual of 
group B is still between groups A and B in the reshuffled 
network. The difference with the case of non-reshuffled 
empirical data is particularly clear for the SFHH case. 
This indicates that the overestimation observed in Figs. 
[2]-[4]is mostly due to the fact that the reconstructed data 
does not reproduce small scale structures of the contact 
networks: such structures might be due to e.g. groups of 
colleagues or friends, whose composition is neither avail¬ 
able as metadata nor detectable in the resampled data 
sets. 


Limitations. 

When the fraction / of nodes excluded by the resam¬ 
pling procedure becomes large, the properties of the re¬ 


sampled data may start to differ substantially from those 
of the whole data set (Figs. SI & S2). As a result, the 
distributions of epidemic sizes of SIR simulations show 
stronger deviations from those obtained on the whole 
data set (Fig. |8|, even if the epidemic risk evaluation 
is still better than for simulations on the resampled net¬ 
works (Fig. |5|. Most importantly however, the informa¬ 
tion remaining in the resampled data at large / can be in¬ 
sufficient to construct surrogate contacts. This happens 
in particular if an entire class or department is absent 
from the resampled data or if all the resampled nodes 
of a class/department are disconnected (see Methods for 
details). We show in the bottom plots of Fig. [5] the failure 
rate, i.e., the fraction of cases in which we are not able to 
construct surrogate networks from the resampled data. 
The failure rate increases gradually with / for the InVS 
data since the groups (departments) are of different sizes. 
For the Thiersl3 data, all classes are of similar sizes so 
that the failure rate reaches abruptly a large value at a 
given value of /. For the SFHH data, we can always con¬ 
struct surrogate networks as the population is not struc¬ 
tured. Another limitation of the reconstruction method 
lies in the need to know the number of individuals miss¬ 
ing in each department or class. If these numbers are 
completely unknown, giving an estimation of outbreak 
sizes is impossible as adding arbitrary numbers of nodes 
and links to the resampled data can lead to arbitrarily 
large epidemics. The methods are however still usable if 
only partial information is available. For instance, if only 
the overall missing number of individuals is available, it 
is possible to use the WT method, which still gives sensi¬ 
ble results. Moreover, if / is only approximately known, 
e.g., / is known to be within an interval of possible val¬ 
ues [fi , , it is possible to perform two reconstructions 

using the respective hypothesis / = /i and f = f 2 and 
to give an interval of estimates. We provide an example 
of such procedure in supplementary figure 23. 


DISCUSSION 

The understanding of epidemic spreading phenomena 
has been vastly improved thanks to the use of data-driven 
models at different scales. High resolution contact data 
in particular have been used to evaluate epidemic risk or 
containment policies in specific populations or to perform 
contact tracing [nmniEniEHiiMEs. In such studies, 
missing data due to population sampling might repre¬ 
sent however a serious issue: individuals absent from a 
data set are indeed equivalent to immunised individuals 
when epidemic processes are simulated. Feeding sampled 
data into data-driven models can therefore lead to severe 
underestimations of the epidemic risk and might even a 
priori affect the evaluation of mitigation strategies if for 
instance some at-risk groups are particularly undersam¬ 
pled. 

Here we have put forward a set of methods to obtain 
a better evaluation of the outcome of spreading simu- 
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lations for data-driven models using contact data from 
a uniformly sampled population. To this aim, we have 
shown how it is possible, starting from a data set describ¬ 
ing the contacts of only a fraction of the population of 
interest (uniformly sampled from the whole population), 
to construct surrogate data sets using various amounts 
of accessible information, i.e., quantities measured in the 
sampled data. We have shown that the simplest method, 
which consists in simply compensating for the decrease in 
the average number of neighbours due to sampling, yields 
a strong overestimation of the epidemic risk. When ad¬ 
ditional information describing the group structure and 
the temporal properties of the data is added in the con¬ 
struction of surrogate data sets, simulations of epidemic 
spreading on such surrogate data yield results similar to 
those obtained on the complete data set. (We note that 
the issue of how much information should be included 
when constructing the surrogate data is linked to the 
general issue of how much information is needed to get 
an accurate picture of spreading processes on temporal 
networks EEIEZIESIGEIESISI.) Some discrepancies in 
the epidemic risk estimation are however still observed, 
due in particular to small scale structural correlations of 
the contact network that are difficult or even impossible 
to measure in the resampled data: these discrepancies 
are indeed largely suppressed if we use as original data a 
reshuffled contact network in which such correlations are 
absent. 

The methods presented here yield much better results 
than simulations using resampled data, even when a sub¬ 
stantial part of the population is excluded, in particular 
in estimating the probability of large outbreaks. It suffers 
however from limitations, especially when the fraction / 
of excluded individuals is too large. First, the construc¬ 
tion of the surrogate contacts relies on the stability of a 
set of quantities with respect to resampling, but the mea¬ 
sured quantities start to deviate from the original ones at 
large /. The shape of the distribution of epidemic sizes 
may then differ substantially from the original one. Sec¬ 
ond, large values of / might even render the construction 
of the surrogate data impossible due to the loss of infor¬ 
mation on whole categories of nodes. Finally, at least 
an estimate of the number of missing individuals in the 
population is needed in order to create a surrogate data 
set. 

An interesting avenue for future work concerns possible 
improvements of the reconstruction methods, in particu¬ 
lar by integrating into the surrogate data additional in¬ 
formation and complex correlation patterns measured in 
the sampled data. For instance, the number of contacts 
varies significantly with the time of day in most contexts: 
the corresponding activity timeline might be measured in 
the sampled data (overall or even for each group of in¬ 
dividuals), assumed to be robust to sampling and used 
in the reconstruction of contact timelines. More system¬ 
atically, it might also be possible to use the temporal 
network decomposition technique put forward in [55] on 
the sampled data, in order to extract mesostructures such 


as temporally-localized mixing patterns. The surrogate 
contacts could then be built in a way to preserve such 
patterns. Indeed, correlations between structure and ac¬ 
tivity in the temporal contact network are known to in¬ 
fluence spreading processes HU EH [Ml ESHSE1 but are 
notoriously difficult to measure. If the group structure 
of the population is unknown, recent approaches based 
on stochastic block models [59] might be used to extract 
groups from the resampled data; this extracted group 
structure could then be used to build the corresponding 
contact matrix and surrogate data sets. 

We finally recall that we have assumed an uniform sam¬ 
pling of nodes, corresponding to an independent random 
choice of each individual of the population to take part 
or not to the data collection. Other types of sampling 
or data losses can however also be present in data col¬ 
lected by wearable sensors, such as partial coverage of 
the premises of interest by the measuring infrastructure, 
non-uniform sampling depending on individual activity 
(too busy persons or, on the contrary, asocial individuals, 
might not want to wear sensors), on group membership, 
or due to clusters of non-participating individuals (e.g., 
groups of friends). In addition, other types of data sets 
such as the ones obtained from surveys or diaries corre¬ 
spond to different types of sampling, as each respondent 
provides then information in the form of an ego-network 
Ml. Such data sets involve potentially additional types 
of biases such as underreporting of the number of con¬ 
tacts and overestimation of contact durations [HI El ED: 
how to adapt the methods presented here is an impor¬ 
tant issue that we will examine in future work. Finally, 
the population under study is (usually) not isolated from 
the external world, and it would be important to devise 
ways to include contacts with outsiders in the data and 
simulations, for instance by using other data sources such 
as surveys. 
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METHODS 

Data 

We consider data sets collected using the So- 
cioPatterns proximity sensing platform (http://www. 
sociopatterns.org) based on wearable sensors that 
detect close face-to-face proximity of individuals wear¬ 
ing them. Informed consent was obtained from 
all participants and the French national bodies re¬ 
sponsible for ethics and privacy, the Commission 
Nationale de l’lnformatique et des Libertes (CNIL, 
http://www.cnil.fr), approved the data collections. 

The high school (Thiersl3) data set |6T] is structured in 
9 classes, forming three subgroups of three classes corre¬ 
sponding to their specialisation in Mathematics-Physics 
(MP, MP*1, MP*2 with respectively 31, 29 and 38 stu¬ 
dents), Physics (PC, PC*, PSI with respectively 44, 39 
and 34 students), or Biology (2BI01, 2BI02, 2BI03 with 
respectively 37, 35 and 39 students). 

The workplace (InVS) data set [46 is structured in 
5 departments: DISQ (Scientific Direction, 15 persons), 
DMCT (Department of Chronic Diseases and Trauma¬ 
tisms, 26 persons), DSE (Department of Health and En¬ 
vironment, 34 persons), SRH (Human Resources, 13 per¬ 
sons) and SFLE (Logistics, 4 persons). 

For the conference data (SFHH), we do not have meta¬ 
data on the participants, and the aggregated network 
structure was found to be homogeneous [22]. 

SIR and SIS simulations 

Simulations of SIR and SIS processes on the tempo¬ 
ral networks of contacts (original, resampled or recon¬ 
structed) are performed using the temporal Gillespie al¬ 
gorithm described in [62] . For each run of the simula¬ 
tions, all nodes are initially susceptible; a node is chosen 
at random as the seed of the epidemic and put in the 
infectious state at a point in time chosen at random over 
the duration of the contact data. A susceptible node 
in contact with an infectious node becomes infectious at 
rate /3. Infectious nodes recover at rate p: in the SIR 
model they then enter the recovered state and cannot 
become infectious again, while in the SIS model they en¬ 
ter the susceptible state again. If needed, the sequence 
of contacts is repeated in the simulation [22] . 

For SIR processes, we run each simulation, with the 
seed node chosen at random, until no infectious individ¬ 
ual remains (nodes are thus either still susceptible or have 
been infected and then recovered). We consider values of 
/3 and p yielding a non-negligible epidemic risk, i.e., such 


that a rather large fraction of simulations lead to a final 
size larger than 20% of the population (see Figs. [lJQ: 
/3 = 4 x 10 _4 s _1 , /i = 4x lO- 7 *- 1 (InVS) or 4 x lO^Ts -1 
(SFHH and Thiersl3). Other parameter values are ex¬ 
plored in the Supplementary Information. For each set of 
parameters, the distribution of epidemic sizes is obtained 
by performing 1,000 simulations. 

For SIS processes, simulations are performed using the 
quasi-stationary approach of [63j. They are run until the 
system enters a stationary state as witnessed by the mean 
number of infected nodes being constant over time. Sim¬ 
ulations are then continued for 50,000 time-steps while 
recording the number of infected nodes. For each set 
of parameters, the simulations are performed once with 
each node of the network chosen as the seed node. 


Reconstruction algorithm 

We consider a population V of N individuals, poten¬ 
tially organised in groups. We assume that all the con¬ 
tacts occurring among a subpopulation V of these indi¬ 
viduals, of size N = (1 — /)7V, are known. This con¬ 
stitutes our resampled data from which we need to con¬ 
struct a surrogate set of contacts concerning the remain¬ 
ing n = N — N = fN individuals for which no contact 
information is available: these contacts can occur among 
these individuals and between them and the members of 
V. We assume that we know the group of each mem¬ 
ber of V\P, and the overall activity timeline, i.e., the 
intervals during which contacts take place, separated by 
nights and weekends. 

To construct the surrogate data (WST method), we 
first compute from the activity timeline the total dura¬ 
tion T u of the periods during which contacts can occur. 

Then, we measure in the sampled data: 

• the density p of links in the aggregated contact net¬ 
work; 

• a row-normalised contact matrix C, in which the 
element Cab gives the probability for a node of 
group A to have a link to a node of group B ; 

• the list {t c } of contact durations; 

• the lists {ri c } mt and {ri c } ext of inter-contact dura¬ 
tions for internal and external links, z.e., for links 
between nodes of the same group and links between 
nodes that belong to two different groups, respec¬ 
tively; 

• the lists {p} mt and {p} ext of numbers of contacts 
per link, respectively for internal (within groups) 
and external (between groups) links; 

• the list {t 0 } of initial times between the start of the 
data set and the first contact between two nodes. 


Given p, we compute the number e of additional links 
needed to keep the network density constant when we 
add the n excluded nodes. 

We then construct each link according to the following 
procedure: 

• a node i is randomly chosen from the set V\P of 
excluded nodes; 

• knowing the group A that i belongs to, we extract 
at random a target group B with probability given 
by Cab; 

• we draw a target node j at random from B (if B = 
A, we take care that i ^ j) such that i and j are 
not linked; 

• depending on whether i and j belong to the same 
group or not, we draw from {p} mt or {p} ext the 
number of contact events p taking place over the 
link ij] 

• from {t 0 }, we draw the initial waiting time before 
the first contact; 

• from {t c }, we draw p contact durations t* 5 , k — 

• from {ri c } mt or {ri c } ext , we drawp—1 inter-contact 
durations t™, m = 1, • • • ,p — 1; 

• if to + T C + T i? > T u, we repeat steps (d) 
to (g) until we obtain a set of values such that to + 

v T k 4- V T rn < T • 

• from to and the rjj and r™, we build the contact 
timeline of the link ij ; 

• finally, we insert in the contact timeline the breaks 
defined by the global activity timeline. 


Possible failure of the reconstruction method at 
large / 

The construction of the surrogate version of the miss¬ 
ing links uses as an input the group structure of the 
subgraph that remains after sampling, as given by the 
contact matrix of the link densities between the different 
groups of nodes that are present in the subpopulation V. 
Depending on the characteristics of V and of the corre¬ 
sponding contacts, the construction method can fail in 
several cases: (i) if an entire group (class/department) 
of nodes in the population is absent from V; (ii) if the 
remaining nodes of a specific group (class/department) 
are all isolated in “P’s contact network; (iii) if, during the 
algorithm, a node of V\P is selected in a certain group A 
but cannot create any more links because it already has 
links to all nodes in the groups B such that Cab 7 ^ 0; 
(iv) if there are either no internal (within groups) or ex¬ 
ternal (between groups) links in the contact network of 


V: in this case one of the lists of link temporal character¬ 
istics is empty and the corresponding structures cannot 
be reconstructed. 

Cases (i) and (ii) correspond to a complete loss 
of information about the connectivity of a group 
(class/department) of the population, due to sampling. 
It is then impossible to reconstruct a sensible connec¬ 
tivity pattern for these nodes. Case (iii) is more subtle 
and occurs in situations of very low connectivity between 
groups. For instance, within the contact network of P, 
a group A has links only with another specific group P, 
and both A and B are small; it is then possible that 
the nodes of (P\P) D A exhaust the set of possible links 
to nodes of B during the reconstruction algorithm. If a 
node of (V\P)C\A is again chosen to create a link, such a 
creation is not possible and the construction fails. Case 
(iv) usually corresponds to situations in which the links 
between individuals of different groups which remain in 
the resampled data set correspond to pairs of individu¬ 
als who have had only one contact event: in such cases, 
{r ic } ext is empty and external links with more than one 
contact cannot be built. 


Shufflings 

In order to test the effect of correlations in the tempo¬ 
ral network, we use four shuffling methods, based on the 
ones defined in 56] . 

Link shuffling. The contact timelines associated with 
each link are randomly redistributed among the links. 
Correlations between timelines of links adjacent to a 
given node are destroyed, as well as correlations between 
weights and topology. The structure of the network is 
kept, as well as the global activity timeline. 

Time shuffling. From the contact data we build the 
lists {r c }, {rid and {p} of, respectively, contact dura¬ 
tions, inter-contact durations and number of contacts 
per link. We also measure the list {to} of initial times 
between the start of the data set and the first contact 
between two nodes. For each link, we draw randomly a 
starting time to, a number p of contacts from {p}, p con¬ 
tact durations from {r c } andp —1 inter-contact durations 
from {Ti C }, so that the total duration of the timeline does 
not exceed the total available time T u . We then construct 
the contact timelines, thus destroying the temporal cor¬ 
relations among contacts. The structure of the network 
is instead kept fixed. 

CM shuffling. We perform a link rewiring separately 
on each compartment of the contact matrix, z.e., we 
randomly redistribute links with their contact timelines 
within each group, and within each pair of groups. We 
thus destroy the structural correlations inside each com¬ 
partment of the contact matrix, while preserving the 
group structure of the network as given by the link den¬ 
sity contact matrix and the contact matrix of total con¬ 
tact times between groups. 

CM-time shuffling. We perform both a CM shuffling 
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and a time shuffling. 


[1] Barrat, A., Barthelemy, M. V Vespignani, A. Dynami¬ 
cal processes on complex networks (Cambridge University 
Press (Cambridge), 2008). 

[2] Read, J. M., Edmunds, W. J., Riley, S., Lessler, J. V 
Cummings, D. A. T. Close encounters of the infectious 
kind: methods to measure social mixing behaviour. Epi¬ 
demiology & Infection 140 , 2117-2130 (2012). 

[3] Edmunds, W. J., O’callaghan, C. J. V Nokes, D. J. Who 
mixes with whom? a method to determine the contact 
patterns of adults that may lead to the spread of airborne 
infections. Proceedings of the Royal Society of London. 
Series B: Biological Sciences 264 , 949-957 (1997). 

[4] Read, J., Eames, K. V Edmunds, W. Dynamic social 
networks and the implications for the spread of infectious 
disease. J R Soc Interface 5, 1001-7 (2008). 

[5] Mossong, J. et al. Social contacts and mixing patterns 
relevant to the spread of infectious diseases. PLoS Med 
5, e74 (2008). 

[6] Danon, L., House, T., Read, J. V Keeling, M. Social en¬ 
counter networks: collective properties and disease trans¬ 
mission. J. R. Soc. Interface 9 , 2826-2833 (2012). 

[7] Danon, L., Read, J. M., House, T. A., Vernon, M. C. 
V Keeling, M. J. Social encounter networks: character¬ 
izing great britain. Proceedings of the Royal Society B: 
Biological Sciences 280 , 1765 (2013). 

[8] Smieszek, T., Burri, E. U., Scherzinger, R. V Scholz, 
R. W. Collecting close-contact social mixing data with 
contact diaries: reporting errors and biases. Epidemiol¬ 
ogy & Infection 140 , 744-752 (2012). 

[9] Smieszek, T. et al. How should social mixing be mea¬ 
sured: comparing web-based survey and sensor-based 
methods. BMC Infectious Diseases 14 , 136 (2014). 

[10] Hui, P. et al. Pocket switched networks and human mo¬ 
bility in conference environments. In WDTN ’05: Proc. 
2005 ACM SIGCOMM workshop on Delay-tolerant net¬ 
working (ACM, New York, NY, USA, 2005). 

[11] O’Neill, E. et al. Instrumenting the city: Develop¬ 
ing methods for observing and understanding the digital 
cityscape. In Ubicomp , vol. 4206, 315-332 (2006). 

[12] Eagle, N., Pentland, A. S. V Lazer, D. Inferring friend¬ 
ship network structure by using mobile phone data. 
Proceedings of the National Academy of Sciences 106 , 
15274-15278 (2009). 

[13] Vu, L., Nahrstedt, K., Retika, S. V Gupta, I. Joint blue¬ 
tooth/wifi scanning framework for characterizing and 
leveraging people movement in university campus. In 
Proceedings of the 13th ACM International Conference 
on Modeling, Analysis, and Simulation of Wireless and 
Mobile Systems , MSWIM ’10, 257-265 (ACM, New York, 
NY, USA, 2010). 

[14] Salathe, M. et al. A high-resolution human contact 
network for infectious disease transmission. Proceedings 
of the National Academy of Sciences 107 , 22020-22025 
( 2010 ). 

[15] Hashemian, M., Stanley, K. V Osgood, N. Flunet: Au¬ 
tomated tracking of contacts during flu season. In Pro¬ 
ceedings of the 6th International workshop on Wireless 
Network Measurements , 557-562 (2010). 


[16] Cattuto, C. et al. Dynamics of person-to-person inter¬ 
actions from distributed RFID sensor networks. PLoS 
ONE 5 , ell596 (2010). 

[17] Hornbeck, T. et al. Using sensor networks to study the 
effect of peripatetic healthcare workers on the spread of 
hospital-associated infections. Journal of Infectious Dis¬ 
eases 206 , 1549-1557 (2012). 

[18] Stopczynski, A. et al. Measuring large-scale social net¬ 
works with high resolution. PLoS ONE 9 , e95978 (2014). 

[19] Obadia, T. et al. Detailed contact data and the dis¬ 
semination of staphylococcus aureus in hospitals. PLoS 
Comput Biol 11, el004170 (2015). 

[20] Toth, D. J. A. et al. The role of heterogeneity in con¬ 
tact timing and duration in network models of influenza 
spread in schools. Journal of The Royal Society Interface 
12, 20150279 (2015). 

[21] Isella, L. et al. What’s in a crowd? Analysis of face-to- 
face behavioral networks. Journal of Theoretical Biology 
271,166-180 (2011). 

[22] Stehle, J. et al. Simulation of an SEIR infectious disease 
model on the dynamic contact network of conference at¬ 
tendees. BMC Medicine 9 , 87 (2011). 

[23] Stehle, J. et al. High-resolution measurements of face-to- 
face contact patterns in a primary school. PLOS ONE 
6, e23176 (2011). 

[24] Fournet, J. V Barrat, A. Contact patterns among high 
school students. PLoS ONE 9 , el07878 (2014). 

[25] Holme, P. V Saramki, J. Temporal networks. Physics 
Reports 519 , 97 - 125 (2012). 

[26] Lee, S., Rocha, L. E. C., Liljeros, F. V Holme, P. Exploit¬ 
ing temporal network structures of human interaction to 
effectively immunize populations. PLoS ONE 7 , e36439 
( 2012 ). 

[27] Machens, A. et al. An infectious disease model on empiri¬ 
cal networks of human contact: bridging the gap between 
dynamic network data and contact matrices. BMC In¬ 
fectious Diseases 13 , 185 (2013). 

[28] Smieszek, T. V Salathe, M. A low-cost method to assess 
the epidemiological importance of individuals in control¬ 
ling infectious disease outbreaks. BMC MEDICINE 11, 
35 (2013). 

[29] Masuda, N. V Holme, P. Predicting and controlling 
infectious disease epidemics using temporal networks. 
FlOOOPrime Reports 5 (2013). 

[30] Gemmetto, V., Barrat, A. V Cattuto, C. Mitigation of in¬ 
fectious disease at school: targeted class closure vs school 
closure. BMC Infectious Diseases 14 , 695 (2014). 

[31] Voirin, N. et al. Combining high-resolution contact data 
with virological data to investigate influenza transmission 
in a tertiary care hospital. Infection Control & Hospital 
Epidemiology 36 , 254-260 (2015). 

[32] Chowell, G. V Viboud, C. A practical method to tar¬ 
get individuals for outbreak detection and control. BMC 
Medicine 11 , 36 (2013). 

[33] Conlan, A. J. K. et al. Measuring social networks in 
british primary schools through scientific engagement. 
Proceedings of the Royal Society B: Biological Sciences 
278 , 1467-1475 (2011). 



10 


[34] Granovetter, M. Network sampling: Some first steps. 
American Journal of Sociology 81 , pp. 1287-1303 (1976). 

[35] Frank, O. Sampling and estimation in large social net¬ 
works. Social Networks 1, 91 - 101 (1979). 

[36] Achlioptas, D., Clauset, A., Kempe, D. & Moore, C. On 
the bias of traceroute sampling: Or, power-law degree 
distributions in regular graphs. In Proceedings of the 
Thirty-seventh Annual ACM Symposium on Theory of 
Computing , STOC ’05, 694-703 (ACM, New York, NY, 
USA, 2005). 

[37] Kossinets, G. Effects of missing data in social networks. 
Social Networks 28 , 247 - 268 (2006). 

[38] Ghani, A. C., Donnelly, C. A. & Garnett, G. P. Sampling 
biases and missing data in explorations of sexual partner 
networks for the spread of sexually transmitted diseases. 
Statistics in Medicine 17 , 2079-2097 (1998). 

[39] Ghani, A. C. & Garnett, G. P. Measuring sexual part¬ 
ner networks for transmission of sexually transmitted dis¬ 
eases. Journal of the Royal Statistical Society: Series A 
(Statistics in Society) 161 , 227-238 (1998). 

[40] Onnela, J.-P. &; Christakis, N. A. Spreading paths in par¬ 
tially observed social networks. Phys. Rev. E 85 , 036106 
( 2012 ). 

[41] Viger, F., Barrat, A., Dall’Asta, L., Zhang, C.-H. & Ko- 
laczyk, E. What is the real size of a sampled network? 
the case of the Internet. Phys. Rev. E 75 , 056111 (2007). 

[42] Bliss, C. A., Danforth, C. M. & Dodds, P. S. Estimation 
of global network statistics from incomplete data. PLoS 
ONE 9, el08471 (2014). 

[43] Zhang, Y., Kolaczyk, E. D. & Spencer, B. D. Estimating 
network degree distributions under sampling: An inverse 
problem, with applications to monitoring social media 
networks. Ann. Appl. Stat. 9, 166-199 (2015). 

[44] Cimini, G., Squartini, T., Gabrielli, A. & Gar- 
laschelli, D. Systemic risk analysis in recon¬ 
structed economic and financial networks. Preprint at 
http://arxiv.org/abs/1411.7613 (2014). 

[45] Bobashev, G., Morris, R. J. &; Goedecke, D. M. Sampling 
for global epidemic models and the topology of an inter¬ 
national airport network. PLoS ONE 3, e3154 (2008). 

[46] Genois, M. et al. Data on face-to-face contacts in an office 
building suggest a low-cost vaccination strategy based on 
community linkers. Network Science 3, 326-347 (2015). 

[47] Cohen, R., Erez, K., ben Avraham, D. & Havlin, S. Re¬ 
silience of the Internet to random breakdowns. Phys. 
Rev. Lett. 85 , 4626-4628 (2000). 

[48] Liben-Nowell, D. & Kleinberg, J. The link-prediction 
problem for social networks. Journal of the American So¬ 
ciety for Information Science and Technology 58 , 1019- 
1031 (2007). 

[49] Smieszek, T., Fiebig, L. & Scholz, R. Models of epi¬ 
demics: when contact repetition and clustering should 


be included. Theoretical Biology and Medical Modelling 
6, 11 (2009). 

[50] Onnela, J.-P. et al. Structure and tie strengths in mobile 
communication networks. Proceedings of the National 
Academy of Sciences 104 , 7332-7336 (2007). 

[51] Peixoto, T. P. Entropy of stochastic blockmodel ensem¬ 
bles. Phys. Rev. E 85, 056122 (2012). 

[52] Karsai, M. et al. Small but slow world: How network 
topology and burstiness slow down spreading. Phys. Rev. 
E 83, 025102 (2011). 

[53] Blower, S. & Go, M.-H. The importance of including dy¬ 
namic social networks when modeling epidemics of air¬ 
borne infections: does increasing complexity increase ac¬ 
curacy? BMC Medicine 9 , 88 (2011). 

[54] Pfitzner, R., Scholtes, I., Garas, A., Tessone, C. J. & 
Schweitzer, F. Betweenness preference: Quantifying cor¬ 
relations in the topological dynamics of temporal net¬ 
works. Physical Review Letters 110 , 198701 (2013). 

[55] Gauvin, L., Panisson, A. &; Cattuto, C. Detecting the 
community structure and activity patterns of temporal 
networks: a non-negative tensor factorization approach. 
PLOS ONE 9 , e86028 (2014). 

[56] Gauvin, L., Panisson, A., Cattuto, C. & Barrat, A. Ac¬ 
tivity clocks: spreading dynamics on temporal networks 
of human contact. Scientific reports 3, 3099 (2013). 

[57] Scholtes, I. et al. Causality-driven slow-down and speed¬ 
up of diffusion in non-markovian temporal networks. Nat. 
Comm 5, 5024 (2014). 

[58] Gauvin, L., Panisson, A., Barrat, A. & Cattuto, 
C. Revealing latent factors of temporal networks for 
mesoscale intervention in epidemic spread. Preprint at 
http://arxiv.org/abs/1501.02758 (2015). 

[59] Peixoto, T. P. Inferring the mesoscale structure of lay¬ 
ered, edge-valued and time-varying networks. Preprint at 
http://arxiv.org/abs/1504.02381 (2015). 

[60] Robins, G., Pattison, P. & Woolcock, J. Missing data 
in networks: exponential random graph (p*) models for 
networks with non-respondents. Social Networks 26 , 257 
- 283 (2004). 

[61] Mastrandrea, R., Fournet, J., & Barrat, A. Con¬ 
tact patterns in a high school: a comparison be¬ 
tween data collected using wearable sensors, con¬ 
tact diaries and friendship surveys. Preprint at 
http://arxiv.org/abs/1506.03645 (2015). 

[62] Vestergaard, C. L. & Genois, M. Temporal gille- 

spie algorithm: Fast simulation of contagion 

processes on time-varying networks. Preprint at 
http://arxiv.org/abs/1504.01298vl (2015). 

[63] Ferreira, S. C., Ferreira, R. S. & Pastor-Satorras, R. Qua¬ 
sistationary analysis of the contact process on annealed 
scale-free networks. Phys Rev E Stat Nonlin Soft Matter 
Phys 83, 066113 (2011). 


11 


TABLES & FIGURES 


Data set 

Type 

N 

r 

T 

Dates 

InVS 

Workplace 

92 

63% 

2 weeks 

June 24th - July 5th 2013 

Thiersl3 

High school 

326 

86% 

1 week 

December 2nd - 7th 2013 

SFHH 

Conference 

403 

34% 

2 days 

June 3rd - 4th 2009 


TABLE I. Data sets. For each data set we specify the type of social situation, the number N of individuals whose contacts 
were measured, the corresponding participation rate r, the duration T and the dates of the data collection. 



/ 

InVS CML 

Thiersl3 CML 


10% 

0.996 [0.937,0.999] 

0.999 [0.998,0.999] 

Resampled 

20% 

0.980 [0.889, 0.994] 

0.996 [0.995,0.997] 


40% 

0.925 [0.872,0.983] 

0.988 [0.983,0.990] 


10% 

0.976 [0.846,0.995] 

0.998 [0.994,0.999] 

Reconstructed 

20% 

0.942 [0.844,0.984] 

0.993 [0.985,0.995] 


40% 

0.890 [0.652,0.953] 

0.977 [0.938,0.987] 


TABLE II. Contact matrix similarities Similarities between the original contact matrices and the contact matrices of the 
resampled networks (top) and of the reconstructed networks (bottom). Median and 90% confidence interval for the cosine 
similarity between link density contact matrices (CML) for different values of /, the fraction of nodes removed from the original 
data. Values were obtained from 100 independent realisations of the resampling and reconstruction procedures. 




In VS Thiersl3 SFHH 


FIG. 1. SIR epidemic simulations on resampled contact networks. We plot the distributions of epidemic sizes (fraction 
of recovered individuals) at the end of SIR processes simulated on top of resampled contact networks, for different values of 
the fraction / of nodes removed. The plot shows the progressive disparition of large epidemic outbreaks as / increases. The 
parameters of the SIR models are f3 = 0.0004 and /3 //x = 1000 (InVS) or f3/fi = 100 (Thiersl3 and SFHH). The case f — 0 
corresponds to simulations using the whole data set, i.e., the reference case. For each value of /, 1, 000 independent simulations 
were performed. 
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FIG. 2. SIR simulations for the InVS (workplace) case. We compare of the outcome of SIR epidemic simulations 
performed on resampled and reconstructed contact networks, for different methods of reconstruction. We plot the distribution 
of epidemic sizes (fraction of recovered individuals) at the end of SIR processes simulated on top of resampled (sample) and 
reconstructed contact networks, for different values of the fraction / of nodes removed, and for the 5 reconstruction methods 
described in the text (0, W, WS, WT, WST). The parameters of the SIR models are /3 = 0.0004 and fl/n = 1000. The case 
f — 0 corresponds to simulations using the whole data set, i.e ., the reference case. For each value of /, 1,000 independent 
simulations were performed. 
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FIG. 3. SIR simulations for the Thiersl3 (high school) case. We compare of the outcome of SIR epidemic simulations 
performed on resampled (top left) and reconstructed contact networks, for different methods of reconstruction. We plot the 
distribution of epidemic sizes (fraction of recovered individuals) at the end of SIR processes simulated on top of resampled 
(sample) and reconstructed contact networks, for different values of the fraction / of nodes removed, and for the 5 reconstruction 
methods described in the text (0, W, WS, WT, WST). The parameters of the SIR models are /? = 0.0004 and (3//i — 100. The 
case f = 0 corresponds to simulations using the whole data set, i.e., the reference case. For each value of /, 1,000 independent 
simulations were performed. 
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FIG. 4. SIR simulations for the SFHH (conference) case. We compare of the outcome of SIR epidemic simulations 
performed on resampled and reconstructed contact networks, for different methods of reconstruction. We plot the distribution 
of epidemic sizes (fraction of recovered individuals) at the end of SIR processes simulated on top of resampled (sample) and 
reconstructed contact networks, for different values of the fraction / of nodes removed, and for three reconstruction methods 
described in the text (0, W, WT). In this case, as the population is not structured, methods W and WS on the one hand, WT 
and WST on the other hand, are equivalent. The parameters of the SIR models are /3 m 0.0004 and f3 / /i— 100. The case f — 0 
corresponds to simulations using the whole data set, i.e., the reference case. For each value of /, 1, 000 independent simulations 
were performed. 
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FIG. 5. Accuracy of the different reconstruction methods. We perform SIR epidemic simulations for each case, for 
different values of the fraction / of missing nodes, for both sampled networks and networks reconstructed with the different 
methods. We compare in each case, and as a function of /, the fraction of outbreaks that lead to a final fraction of recovered 
individuals Too larger than 20% of the population (a, b, c), and the average size of these large outbreaks (d, e, f). The 
dashed lines give the corresponding values for simulations performed on the complete data sets. The different methods are: 
reconstruction conserving only the link density and the average weight of the resampled data (0); reconstruction conserving 
only the link density and the distribution of weights of the resampled data (W); reconstruction preserving, in addition to the 
W method, the group structure of the resampled data (WS); reconstruction conserving link density, distribution of weights and 
distributions of contact times, of inter-contact times and of numbers of contacts per link measured in the resampled data (WT); 
full method conserving all these properties (WST). We also plot as a function of / the failure rate of the WST algorithm, z.e., 
the percentage of failed reconstructions (g, h, i). For the SFHH case, as the population is not structured into groups, methods 
W and WS are equivalent, as well as methods WT and WST; moreover, reconstruction is always possible. The SIR parameters 
are m 0.0004 and (3/(i — 1000 (InVS) or /3/ji — 100 (Thiersl3 and SFHH) and each point is averaged over 1, 000 independent 
simulations. 
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FIG. 6. SIS simulations for the InVS (workplace) case. We perform SIS epidemic simulations and report the phase 
diagram of the SIS model for the original, resampled and reconstructed contact networks. Each panel shows the stationary 
value ioo of the prevalence in the stationary state of the SIS model, computed as described in Methods, as a function of /3, for 
several values of /i. Simulations are performed in each case using either the complete data set (continuous lines), resampled 
data (dashed lines) or reconstructed contact networks using the WST method (pluses). The fraction of excluded nodes in the 
resampling is f = 20% for a, c, e, g and / = 40% for b, d, f, h. 
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FIG. 7. SIR simulations on shuffled data. We compare of the outcome of SIR epidemic simulations performed on 
resampled and reconstructed contact networks, for shuffled data. We plot the distribution of epidemic sizes (fraction of 
recovered individuals) at the end of SIR processes simulated on top of either resampled (a, c, e) or reconstructed (b, d, f) 
contact networks, for different values of the fraction / of nodes removed. We use here the WST reconstruction method, and 
the data set considered consists in a CM-shuffled version (see Methods) of the original data, in which the shuffling procedure 
removes structural correlations of the contact network within each group. The parameters of the SIR models are (3 — 0.0004 
and (3/[i — 1000 (InVS) or (3/ji = 100 (Thiersl3 and SFHH). The case f — 0 corresponds to simulations using the whole data 
set, z.e., the reference (reshuffled data) case. For each value of /, 1,000 independent simulations were performed. 
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FIG. 8. SIR simulations for very large numbers of missing nodes. We simulate SIR processes on reconstructed contact 
networks for large values of the fraction / of removed nodes. We plot the distributions of epidemic sizes for simulations on 
reconstructed networks and on the whole data set (case f — 0), for large values of the fraction / of removed nodes. Here 
ft = 0.0004 and f3/ii = 1000 (InVS) or P/fi — 100 (Thiersl3 and SFHH) and 1,000 simulations were performed for each value 
of /. The distributions of epidemic sizes for simulations performed on resampled data sets are not shown since at these high 
values of /, almost no epidemics occur. 
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Supplementary Figure 9. Effect of sampling on contact network properties. Comparison of the distributions of structural 
(node degrees and link weights in the aggregated network of contacts) and temporal (contact durations, inter-contact times, 
number of contacts per link) properties of the contact networks, for different fractions / of removed nodes. For each value of 
/, the distributions are computed on a single realisation of the resampling. 
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Supplementary Figure 10. Effect of sampling on network density and on the similarity of contact matrices. (Left) 
Density p of the aggregated network of contacts as a function of the fraction / of nodes excluded. The shaded areas represent 
mean p =L s.e.m.. (Right) Median cosine similarities between the link density contact matrices (CML) of resampled and full 
data sets, as a function of /, for the structured populations (high school and workplace). Results are averaged, for each value 
of /, over 1,000 realisations for the density and over 100 realisations for the similarities. 
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Supplementary Figure 11. Effect of sampling and reconstruction on the average clustering coefficient (left column) 
and on the network transitivity (right column). The continuous lines show the evolution of clustering coefficient (left) 
and network transitivity when the fraction / of removed nodes increases. The full circles correspond to the same quantities for 
networks reconstructed using the WST method. Each point is averaged on 100 realisations. 









23 


0 % 10 % 


DISQ 

0.533 

0.094 

0.073 

0.029 

0.172 _ 

DISQ 

0.533 

0.075 

0.058 

0.023 

0.131 . 

DMCT 

0.094 

0.347 

0.056 

0.056 

0.092 . 

DMCT 

0.075 

0.348 

0.046 

0.046 

0.073 

DSE 

0.073 

0.056 


0.085 

0.101 . 

DSE 

0.058 

0.046 


0.068 

0.082 

SRH 

0.029 

0.056 

0.085 


0.135 . 

SRH 

0.023 

0.046 

0.068 


0.108 

SFLE 

0.172 

0.092 

0.101 

0.135 

0.333 

SFLE 

0.131 

0.073 

0.082 

0.108 



DISQ 

DMCT 

DSE 

20% 

SRH 

SFLE 


DISQ 

DMCT 

DSE 

40% 

SRH 

SFLE 

DISQ 

0.531 

0.06 

0.045 

0.019 

0.115 _ 

DISQ 

0.531 

0.033 

0.027 

0.009 

0.057 . 

DMCT 

. 0.06 

0.357 

0.036 

0.036 

0.062 _ 

DMCT 

. 0.033 

0.337 

0.021 

0.021 

0.033 . 

DSE 

. 0.045 

0.036 

0.319 

0.056 

0.07 

DSE 

0.027 

0.021 


0.029 

0.037 . 

SRH 

. 0.019 

0.036 

0.056 

0.798 

0.09 . 

SRH 

. 0.009 

0.021 

0.029 


0.047 . 

SFLE 

. 0.115 

0.062 

0.07 

0.09 

0.373 

SFLE 

0.057 

0.033 

0.037 

0.047 

0.263 


DISQ 

DMCT 

DSE 

SRH 

SFLE 


DISQ 

DMCT 

DSE 

SRH 

SFLE 


Supplementary Figure 12. Effect of sampling: link density contact matrices ( InVS ). Comparison of link density contact 
matrices for the workplace, for different fractions of excluded nodes, /, with the original one (/ = 0). Each matrix element 
AB gives the number of links between nodes of department A and nodes of department B in the contact network, normalised 
by the maximum possible number of such links. For each value of /, each matrix element is an average over 100 realisations of 
the sampling. 
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Supplementary Figure 13. Effect of sampling: link density contact matrices ( ThierslS ). Comparison of the link density 
contact matrices for the high school, for different fractions / of excluded nodes, with the original one (/ = 0). Each matrix 
element AB gives the number of links between nodes of class A and nodes of class B in the contact network, normalised by the 
maximum possible number of such links. For each value of /, each matrix element is an average over 100 realisations of the 
sampling. 
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Supplementary Figure 14. Distributions of temporal characteristics for internal (within groups) and external 
(between groups) contacts and links (InVS data). Symbols are for the original data, full lines for resampled data with 
/ sis 20%, dotted lines for / =» 40%. Legends give the average values for each distribution. 
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Supplementary Figure 15. Distributions of temporal characteristics for internal (within groups) and external 
(between groups) contacts and links ( Thiersl3 data). Symbols are for the original data, full lines for resampled data 
with / = 20%, dotted lines for / = 40%. Legends give the average values for each distribution. 
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Supplementary Figure 16. Properties of the reconstructed contact network. Same as Fig. [9] but for the reconstructed 
networks: Distributions of structural (degrees and weights in the aggregated contact network) and temporal (contact times, 
inter-contact times, number of contacts per link) properties of the surrogate contact networks, for different fractions / of nodes 
excluded. For each value of /, the distributions are computed on a single reconstructed network. 
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Supplementary Figure 17. Properties of the reconstructed contact network: link density contact matrices ( InVS ). 

Comparison of link density contact matrices for the reconstructed network of the workplace data, for different values of the 
fraction / of excluded nodes, with the original one (/ = 0). For each value of /, each matrix element is an average over 100 
realisations of the sampling. 
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Supplementary Figure 18. Properties of the reconstructed contact network: link density contact matrices 
{Thiers 13). Comparison of link density contact matrices for the reconstructed network of the high school data, for dif¬ 
ferent values of the fraction / of excluded nodes, with the original one (/ = 0). For each value of /, each matrix element is an 
average over 100 realisations of the sampling. 
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Supplementary Figure 19. Similarity of contact matrices for different reconstruction methods Median cosine 
similarity between the link density and contact time density contact matrices computed between the reconstructed network 
and for the original contact matrices, as a function of the fraction / of removed nodes. For each value of /, the median is 
computed over 100 realisations of the reconstruction. 
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Supplementary Figure 20. Properties of the reconstructed contact network: time density contact matrices ( InVS ). 

Comparison of the contact time density contact matrices for the reconstructed network of the workplace data, for different 
fractions of excluded nodes, /, with the original one (/ — 0). Each matrix element AB gives the average time spent in contact 
between a node of department A and a node of department B. For each value of /, each matrix element is an average over 100 
realisations of the sampling. 
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Supplementary Figure 21. Properties of the reconstructed contact network: time density contact matrices 
{Thiers 13). Contact time density contact matrices for the reconstructed network of the high school data, for different 
fractions of nodes excluded, /. Each matrix element AB gives the average time spent in contact between a node of class A and 
a node of class B. For each value of /, each matrix element is an average over 100 realisations of the sampling. 
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Supplementary Figure 22. Phase diagram of the SIS model for original, resampled and reconstructed contact 
networks (Thiers 13 data set). Each panel shows the stationary value ioo of the prevalence in the stationary state of the 
SIS model, computed as described in the Methods section, as a function of /3, for several values of /i. Here we consider the 
example of the Thiers 13 data set. The epidemic threshold corresponds to the transition between = 0 and ioo > 0. The 
prevalence curves are computed in each case using either the whole data set (continuous lines), resampled data (dashed lines) 
or reconstructed contact networks (pluses). The fraction of excluded nodes in the resampling is f — 20% for the left column 
and / = 40% for the right column. 
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Supplementary Figure 23. Phase diagram of the SIS model for original, resampled and reconstructed contact 
networks ( SFHH data set). Same as Fig. 22 for the SFHH (conference) data set. 
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Supplementary Figure 24. Outcome of SIR epidemic simulations on resampled and reconstructed networks for 
different parameter values. Distribution of epidemic sizes (fraction of recovered individuals) at the end of SIR processes 
simulated on top of either resampled (left column) or reconstructed (right) contact networks, using the WST method, for 
different values of the fraction / of nodes removed. The parameters of the SIR models are /3 = 0.004 and /3//jl = 1000 ( InVS ) 
or /3/fi = 100 (ThierslS and SFHH). The case f — 0 corresponds to simulations using the whole data set, i.e., the reference 
case. For each value of /, 1,000 independent simulations were performed. 
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Supplementary Figure 25. Outcome of SIR epidemic simulations on resampled and reconstructed networks for 
different parameter values. Distribution of epidemic sizes (fraction of recovered individuals) at the end of SIR processes 
simulated on top of either resampled (left column) or reconstructed (right) contact networks, using the WST method, for 
different values of the fraction / of nodes removed. The parameters of the SIR models are ft = 0.04 and P//jl = 1000 ( InVS) or 
P/li — 100 (Thiers 13 and SFHU). The case f — 0 corresponds to simulations using the whole data set, i.e., the reference case. 
For each value of /, 1,000 independent simulations were performed. 
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Supplementary Figure 26. Outcome of SIR epidemic simulations on resampled and reconstructed networks for 
different parameter values. Distribution of epidemic sizes (fraction of recovered individuals) at the end of SIR processes 
simulated on top of either resampled (left column) or reconstructed (right) contact networks, using the WST method, for 
different values of the fraction / of nodes removed. The parameters of the SIR models are ft = 0.04 and /3/n = 4000 ( InVS) or 
P/li — 400 (Thiers 13 and SFHU). The case f — 0 corresponds to simulations using the whole data set, i.e., the reference case. 
For each value of /, 1,000 independent simulations were performed. 








38 


Sampled network Reconstructed network 






Epidemic size Epidemic size 


Supplementary Figure 27. Outcome of SIR epidemic simulations on resampled and reconstructed networks for 
different parameter values. Distribution of epidemic sizes (fraction of recovered individuals) at the end of SIR processes 
simulated on top of either resampled (left column) or reconstructed (right) contact networks, using the WST method, for 
different values of the fraction / of nodes removed. The parameters of the SIR models are /3 = 0.0004 and = 500 ( InVS ) 
or p/fjb — 50 (ThierslS and SFHH). The case f — 0 corresponds to simulations using the whole data set, i.e., the reference case. 
For each value of /, 1,000 independent simulations were performed. 
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Supplementary Figure 28. WST method with constrained transitivity. Comparison of the outcome of SIR epidemic 
simulations performed on resampled and reconstructed contact networks. Distribution of epidemic sizes (fraction of 
recovered individuals) at the end of SIR processes simulated on top of either resampled (left column) or reconstructed (right) 
contact networks, for different values of the fraction / of nodes removed. The parameters of the SIR models are fi — 0.0004 
and /3//jl — 1000 (InVS) or SIl 1 — 100 (Thiers 13 and SFHH). The case f — 0 corresponds to simulations using the whole data 
set, i.e., the reference case. For each value of /, 1,000 independent simulations were performed. 
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Supplementary Figure 29. Method WST on link-shuffled network. Comparison of the outcome of SIR epidemic 
simulations performed on resampled and reconstructed contact networks. Distribution of epidemic sizes (fraction of 
recovered individuals) at the end of SIR processes simulated on top of either resampled (left column) or reconstructed (right) 
contact networks, for different values of the fraction / of nodes removed. The parameters of the SIR models are fi — 0.0004 
and /3//jl — 1000 (InVS) or SIl 1 — 100 (Thiers 13 and SFHH). The case f — 0 corresponds to simulations using the whole data 
set, i.e., the reference case. For each value of /, 1,000 independent simulations were performed. 
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Supplementary Figure 30. Method WST on time-shuffled network. Comparison of the outcome of SIR epidemic 
simulations performed on resampled and reconstructed contact networks. Distribution of epidemic sizes (fraction of 
recovered individuals) at the end of SIR processes simulated on top of either resampled (left column) or reconstructed (right) 
contact networks, for different values of the fraction / of nodes removed. The parameters of the SIR models are fi — 0.0004 
and /3/ /jl — 1000 (InVS) or SIl 1 — 100 (Thiers 13 and SFHH). The case f — 0 corresponds to simulations using the whole data 
set, i.e., the reference case. For each value of /, 1,000 independent simulations were performed. 
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Supplementary Figure 31. Uncertainty on the sampling fraction. Comparison of the outcome of SIR epidemic 
simulations performed on contact networks where 25 % of nodes were removed, and reconstructed with different 
values of the assumed sampling fraction. Distribution of epidemic sizes (fraction of recovered individuals) at the end of 
SIR processes simulated on top of either resampled (left column) or reconstructed (right) contact networks, for different values 
of the fraction / of nodes removed. The parameters of the SIR models are f3 = 0.0004 and fl/fi — 1000 ( InVS) or [3 / /i = 100 
(ThierslS and SFHH). The case “Original data” corresponds to simulations using the whole data set, i.e., the reference case. 
For each value of /, 1,000 independent simulations were performed. 





43 


SUPPLEMENTARY NOTES 

Supplementary note 1: Effect of sampling on the temporal network of contacts 

As described in the main text, we consider temporally resolved networks of contacts T in a population V of N 
individuals and we perform a resampling experiment by selecting a subpopulation V of these individuals, of size 
N = (1 — f)N. We assume that only the contacts occurring among the subpopulation V are known and we compare 
the properties of the corresponding resampled subnetwork T with those of the original network. 

Supplementary Figure [9] shows how population sampling affects several statistical properties of the contact networks. 
On the one hand, the degree distribution of the aggregated network of contacts systematically shifts towards smaller 
degree value. This is expected as each remaining node has in the resampled network a degree which is at most its 
degree in the original network, and is strictly smaller if some of its neighbours are not part of the resampled population. 
On the other hand, the statistical distributions of several quantities of interest are not affected by sampling: This 
is the case of the quantities attached either to single contacts or to single links, namely contact and inter-contact 
durations, number of contacts per link and link weights (the weight of a link is given by the total duration of the 
contacts between the two corresponding nodes). 

Moreover, as shown in Supplementary Figure [To] the density of the aggregated network, i.e. the ratio between the 
number of links and the number of possible links, is on average conserved by the random resampling procedure. It 
varies however for different realisations of the resampling, and the corresponding variance increases with the fraction 
/ of excluded nodes. 

Supplementary Figure [Tl] shows how the average clustering coefficient of the aggregated network varies with the 
resampling: notably, it remains high and close to its original value until large values of / are reached. The transitivity 
of the network, defined as three times the number of triangles divided by the number of connected triplets (connected 
subgraphs of 3 nodes and 2 edges), is even less affected than the clustering coefficient by the resampling procedure. 

In the case of structured populations, Supplementary Figures [12] & [13] show that the stability of the resampled 
network’s density holds at the more detailed level of the contact matrices of link densities. In such matrices, the 
element (i, j) is given by the number of links between individuals of groups i and j, normalised by the total number 
of possible links between these two groups (if rii denotes the number of individuals in group i, the number of possible 
links is equal to riifij /2 for i ^ j and to rii(ni — l)/2 for i = j). These figures clearly illustrate how the diagonal and 
block-diagonal structures are preserved, and Supplementary Figure [lO] gives a quantitative assessment of this stability 
by showing that the cosine similarity between contact matrices between the resampled and original aggregated contact 
networks remains high even for when a large fraction of the nodes are excluded. 

We moreover illustrate in Supplementary Figures [M] and [15] the difference in statistical properties of contacts and 
links within and between groups, still for structured populations: 

• the distributions of contact durations are indistinguishable; 

• the distribution of link weights (aggregated contact durations) is broader for links between individuals belonging 
to the same group than for links joining individuals of different groups; 

• this is due to the difference in the distributions of numbers of contacts per link, which is broader for links within 
groups than for links between groups; 

• the distributions of inter-contact durations differ also slightly, with smaller averages for within-group links. 

Most importantly, all these properties and distributions remain stable under resampling, showing that reliable infor¬ 
mation on the distributions of contact and inter-contact durations, aggregated contact durations, numbers of contacts 
per link, can be obtained in the resampled data, including the statistical differences between links joining members 
of different groups and links between two individuals of the same group. 


Supplementary note 2: Properties of the reconstructed contact networks 

As described in the main text and in particular in the Methods section, we construct a surrogate set of contacts 
concerning the fN individuals excluded by the resampling. We compare here the properties of the resulting contact 
networks (obtained by merging the resampled contact network T and the surrogate set of contacts) and of the original 
contact network, T. 

Supplementary Figure [16] shows that the degree distribution, which is not constrained by the reconstruction proce¬ 
dure, deviates from the original distribution. On the other hand, the distributions of contact durations, inter-contact 
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durations, number of contacts per link and link weights are preserved. Moreover, the link density contact matrices of 
the reconstructed networks (Supplementary Figure [l7|&[l8| share a high similarity with the original contact matrices, 
even for high fractions of nodes excluded (Supplementary Figure [l9|). 

For completeness, we also compute the contact matrices in contact time density (CMT), in which each element 
(z, j) is given by the total time in contact between individuals of groups i and j, normalised by the total number 
of possible links between these two groups: it gives the average time spent in contact by two random individuals of 
groups i and j. 


Supplementary Figures [19j [20] and [21] show that the structure of these matrices is well recovered by 
the reconstruction methods, with high similarity with the original matrices. 


Supplementary note 3: Phase diagram of the SIS model for the conference and high school data sets 


We observe for the high school and the conference the same effect on the phase diagram of the SIS model as in the 
workplace: sampling leads to a shift of the epidemic threshold to higher values and thus to an underestimation of the 
epidemic risk. The phase diagram and the epidemic threshold are estimated more accurately by using reconstructed 
networks, thus giving a better evaluation of the epidemic risk (Supplementary Figures 22 & 23). 


Supplementary note 4: Sensitivity analysis 


In the main text, we have considered values of the SIR model parameters leading to a non-negligible epidemic 
risk and a value of /? corresponding to slow processes. We consider here several other values of the parameters, 
corresponding either to faster processes (Supplementary Figures [24] - 26) or to smaller epidemic risk (Supplementary 


Figure 27). In all cases, simulations performed on the resampled contact networks lead to a strong underestimation 
of the epidemic sizes, with distributions shifting to smaller values as / increases, while the use of reconstructed data 
sets leads to a better estimation and generally speaking a slight overestimation of the epidemic risk. 


SUPPLEMENTARY METHODS 


Detailed alternative reconstruction methods 

We give here details on the alternative reconstruction methods mentioned in the main text, which use less infor¬ 
mation than the WST method. In each case we consider the same setup as the complete method: a population V of 
N individuals (the nodes of the contact network), potentially organised in groups, for which we know all the contacts 
taking place among a subpopulation V of size N = (1 — f)N. For the remaining n = N — N = fN individuals, no 
contact information is available, but we know to which group they belong. We also have access to the overall activity 
timeline, z.e., to the successive intervals during which contacts can happen (daytimes), and are excluded (nights and 
weekends). The alternative reconstruction methods are the following: 

0 : We perform the reconstruction using only the network density and the average link weight, both measured in the 
resampled network T. The algorithm goes as follows: 

1. we measure in the resampled data: 

• the density p of links in the time-aggregated network; 

• the average link weight (w) s (the weight of a link is defined as the total contact time between the two 
linked nodes); 

2. we compute the number of links e that must be added to keep the network density constant when we add 
the n excluded nodes; 

3. we construct e links according to the following procedure: 

• a node i is randomly chosen from the set V\P of excluded nodes; 

• a node j is randomly chosen from the set V\{i} of all other nodes; 

• we compute riij = (w) s /At, where At = 20s is the temporal resolution of the data set, and we randomly 
choose riij time windows of length At within the activity windows defined by the activity timeline as 
contact events between i and j. 







45 


W: We perform the reconstruction using only the network density and the distribution of link weights, both measured 
in the resampled network T. The algorithm goes as follows: 

1. we measure in the resampled data: 

• the density p of links in the time-aggregated network; 

• the list {re} of link weights (the weight of a link is defined as the total contact time between the two 
linked nodes); 

2. we compute the number of links e that must be added to keep the network density constant when we add 
the n excluded nodes; 

3. we construct e links according to the following procedure: 

• a node i is randomly chosen from the set V\P of excluded nodes; 

• a node j is randomly chosen from the set V\{i} of all other nodes; 

• from {re}, we draw the weight Wij of the link ij ; 

• we compute riij = re^/At, where At = 20s is the temporal resolution of the data set, and we randomly 
choose riij time windows of length At within the activity windows defined by the activity timeline as 
contact events between i and j. 

WS: We perform the reconstruction using the network density, the distributions of link weights for internal (within 
groups) and external (between groups) links, and the structure of the aggregated network given by the link 
density contact matrix, all measured in the resampled network T. The algorithm goes as follows: 

1. we measure in the resampled data: 

• the density p of links in the time-aggregated network; 

• a row-normalised contact matrix (7, in which the element Cab gives the probability for a node in group 
A to have a link to a node of group B\ 

• the lists {ie} int and {rc} ext of link weights for respectively internal and external links (internal links 
are links between nodes that belong to the same group, external links are links between nodes from 
different groups); 

2. we compute the number of links e that must be added to keep the network density constant when we add 
the n excluded nodes; 

3. we construct e links according to the following procedure: 

• a node i is randomly chosen from the set V\P of excluded nodes; 

• knowing the group A that i belongs to, we extract at random a target group B with probability given 
by Cab; 

• we draw a target node j at random from B (if B = A, we check that j ^ i); 

• depending on whether nodes i and j belong to the same group or not, we draw from {re} mt or {u>} ext 
the weight Wij of the link ij ; 

• as for the W method, we extract at random Wij/At contact events of length At = 20s within the 
activity timeline. 

WT: We perform the reconstruction using the network density, the distribution of link weights and the temporal 
structure of the contacts given by the distributions of contact durations, inter-contact durations, number of 
contacts per link and initial waiting times before the first contact, all measured in the resampled network T. 
The algorithm goes as follows: 

1. we compute from the activity timeline the time T u as the total duration of the periods during which contacts 
can occur. 

2. we measure in the resampled data: 

• the density p of links in the time-aggregated network; 

• the list {t c } of contact durations; 

• the list {Ti C } of inter-contact durations; 

• the list {p} of numbers of contacts per link; 

• the list {to} of initial waiting times before the first contact for each link; 
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3. we compute the number of links e that must be added to keep the network density constant when we add 
the n excluded nodes; 

4. we construct e links according to the following procedure: 

(a) a node i is randomly chosen from the set V\P of excluded nodes; 

(b) a node j is randomly chosen from the set V\{i} of all other nodes; 

(c) we draw from {p} the number of contact events p taking place over the link ij ; 

(d) from {to}, we draw the initial waiting time to before the first contact; 

(e) from {r c }, we draw p contact durations t* , k = 1, • • • ,p; 

(f) from {Ti c }, we draw p — 1 inter-contact durations r™, m = 1, • • • ,p — 1; 

(g) while t 0 + J2k T c + J2 m T i? > T, we repeat steps (c) to (f); 

(h) from fo>the r t fc and r”'. we build the contact timeline of the link ij ; 

(i) finally, we insert in the contact timeline the breaks defined by the activity timeline. 


Reconstruction with fixed transitivity 

In order to constrain the transitivity to its value measured in the resampled data, we add to the WST algorithm 
the following elements: 

1. we measure in the resampled data the transitivity a o of the time-aggregated network; 

2. for the construction of each link of a node i : 

• we calculate the current transitivity a of the network; 

• we list the potential targets j in two lists Ca and C A , depending on whether the creation of a link between 
i and j would close a triangle or not; 

• — if a < do, we draw a target node j at random from Ca such that i and j are not linked; 

— else if cr > cto, we draw a target node j at random from C A such that i and j are not linked. 

We show in Supplementary Figure [28] the outcome of simulations performed on reconstructed data sets using this 
modified algorithm. 



